SAFE:
Investigating AI Weather Forecasting Models with
Stratified Assessments
of
Forecasts
over
Earth

by
Nick Masi

Artificial intelligence (AI) is revolutionizing weather forecasts, improving both their accuracy and computational efficiency. However, these models have a fatal flaw. The measure of an AI forecasting models quality is its average accuracy across all gridpoints over the globe. This approach is in line with the mathematical roots of the AI field, but fails to capture the real world impacts that drive our desires for accurate weather forecasting. To understand why, lets take a look at where on Earth model performance is worst. We will investigate GraphCast, Google's state of the art deterministic AI model for weather forecasting, and consider its ability to predict atmospheric temperature 3 days in advanceβ€”a common benchmark for models.

The go-to metric in assessing the performance of these predictions is the root mean square error (RMSE). Typically, RMSE is both temporally and geospatially averaged, meaning you get one number to report as the quality of your model. Convenient. We will begin by eliminating the reduction over the geospatial dimension (latitude and longitude), allowing us to see the RMSE at each individual 1.5Β° by 1.5Β° cell across Earth. In addition to sticking with GraphCast, the timespan that model performance will be assessed over will consistently be twice-daily temperature values throughout 2020. GraphCast predictions were retrieved from WeatherBench 2, and temperature data from ECMWF's ERA5 dataset, made available on the Copernicus Climate Data Store.

It is apparent that the model does not perform uniformly well across the globe. This is the pernicious result of using geospatially averaged accuracy as the one and only metric: unfair performance disparaties get masked. Given that the accuracy of extreme heat forecasts has a direct effect on mortality, it is a matter of life and death to be aware of the relative strengths of different models at individual locations. Let's take a look at where on the world GraphCast performs worst.

We find that there are a few notable outliers: Greece (GRC), Bulgaria (BGR), the Republic of North Macedonia (MKD), TΓΌrkiye (TUR), Albania (ALB), Kosovo (XKX), and Namibia (NAM). These are regions where GraphCast performs significantly worse as compared to in all other territories. Having this sort of knowledge is important, because it can help decision-makers in those territories to be aware of whether tools like GraphCast are appropriate for their use.

The results from looking at GraphCast have been interesting, but it is an outstanding question whether these disparities are an idiosyncracy of GraphCast or more systemically present in AI weather forecasting models. To explore this question, we will now consider 5 additional models: Google's Spherical CNN, Keisler's GNN, NeuralGCM, Huawei's Pangu-Weather, and FuXi. Additionally, let's look at not just their bias by territory, but also when grouping the territories by their global subregion and income. We will characterize the unfairness in a model by looking at the greatest difference in RMSE between any two groups for each attribute.

We see that it is not just GraphCast! Across all models, and across all three attributes, there are disparities in model performance. All other model prediction data also comes from WeatherBench 2.

SAFE is a new open-source package I have developed to facilitate all of the data exploration and fairness assessments I conducted. SAFE was used to gather the per-attribute RMSE values in this exploration. Overall, the tool empowers decision-makers with the insight to use the most locally-accurate model for them, and encourages AI developers to prioritze fairness in their model performance by providing a convenient way to perform stratified assessmentsβ€”breaking free of the single-metric paradigm.